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Abstract 

Many datasets such as market basket data, text or hypertext 
documents, and sensor observations recorded in different loca- 
tions or time periods, are modeled as a collection of sets over a 
ground set of keys. We are interested in basic aggregates such as 
the weight or selectivity of keys that satisfy some selection pred- 
icate defined over keys' attributes and membership in particular 
sets. This general formulation includes basic aggregates such as 
• the Jaccard coefficient, Hamming distance, and association rules. 

On massive data sets, exact computation can be inefficient or 
infeasible. Sketches based on coordinated random samples are 
classic summaries that support approximate query processing. Queries 
are resolved by generating a sketch ( sample ) of the union of sets 
used in the predicate from the sketches these sets and then apply- 
ing an estimator to this union-sketch. 

We derive novel tighter ( unbiased) estimators that leverage sam- 
pled keys that are present in the union of applicable sketches but 
excluded from the union sketch. We establish analytically that our 
estimators dominate estimators applied to the union-sketch for all 
queries and data sets. Empirical evaluation on synthetic and real 
data reveals that on typical applications we can expect a 25%-4 
fold reduction in estimation error. 



1. Introduction 

We consider datasets modeled as a collection S of (possibly in- 
tersecting) sets, denned over a ground set I of (possibly weighted) 
keys. A classic example is documents over features or terms, ac- 
cording to presence in the document. 

Basic aggregates over such data are weight and selectivity of 
subpopulations of keys. A query specifies a subpopulation of / 
by a selection predicate. The weight aggregate is the sum of the 
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weights of the keys that satisfy the predicate. If keys have uni- 
form weights, the weight aggregate is known as DV (distinct val- 
ues) count. An example of a weight query is the number of terms 
present both in document A and in document B and are at least 
5 characters long. Selectivity queries are defined with respect to 
some (sub) collection of sets: The result is the ratio of the sum 
of the weights of all keys in the union of these sets for which the 
predicate holds and the total weight of the union of these sets. An 
important selectivity aggregate is the Jaccard coefficient of A and 
B defined as |>lnB|/|y4UB|, which measures the similarity be- 
tween A and B. A common technique to enhance this similarity 
metric is to assign larger weights to features/terms that are less 
frequent in the corpus. For weighted keys, the Jaccard coefficient 
generalizes to w(An B) /w(AU B) (the ratio of the weight of the 
intersection and the weight of the union). 

Basic (approximate) weight aggregates are also used to com- 
pute more complex (approximate) aggregates, such as variance 1151 
of a subpopulation of keys or ratio of the weights of two subpopu- 
lations of keys. 

The selection predicates that specify subpopulations are de- 
fined using conditions on keys' attributes and on keys' member- 
ships in the different sets. We distinguish between attribute-based 
conditions, that are based on properties available through the iden- 
tifier of the key (length, origin, or frequency of a term, type of 
feature) and membership-based conditions that are based on the 
key's set memberships. For example, terms common to two doc- 
uments A, B are specified using the predicate with membership- 
based conditions "in A and in B". The predicate "in A and not in 
B and length > 5" has both attribute-based (length of a term) and 
membership-based conditions. 

We list additional datasets that fall in this framework. 

• Sensor nodes recording daily vehicle traffic in different 
locations in a city: Keys are distinct vehicles (license plate num- 
bers) and sets are location-date pairs (all vehicles observed at that 
location that date). Example queries with membership-based con- 
ditions: "number of distinct vehicles which operated in Manhat- 
tan on election day, 2008" (size of the union of all Manhattan lo- 
cations in election day) or "number of distinct vehicles operated 
in Tribeca on both Sunday and Monday on election week" (size 
of intersection of the union of locations in Tribeca neighborhood 
in Monday and Tuesday); "number of vehicles that crossed both 
the Hudson and the East River on Independence day 2008" (size 
of intersection of the union of bridges/tunnels across the Hudson 
and bridges/tunnels across the East River) etc. Queries can be re- 



stricted to particular classes of vehicles (e.g., taxi cubs or heavy 
trucks) by adding attribute-based conditions. Such queries can be 
used for planning purposes. 

• Market-basket dataset: Keys are goods, each with an as- 
sociated marketing cost (these are the weights). Each customer 
(basket) defines a set which is the set of goods she purchased. 
Example queries are "the total marketing cost of baby products 
purchased by male customers from Union county." This predi- 
cate has attribute-based condition (product type) and membership- 
based conditions (specification of the customer segment as a union 
of sets). 

• "Inverted" market-basket dataset: Keys are baskets (cus- 
tomers) and sets are goods (all baskets containing that particular 
good). A query that asks "what is the likelihood that a certain item 
is purchased given that another item is purchased" (this is an "as- 
sociation rule" (TJ [42)) can be expressed using a predicate with 
membership-based conditions. If A is the set of customers pur- 
chasing, say beer, and B is the set of customers purchasing diapers 
then the selectivity of AnB with respect to B is just the likelihood 
that a person purchases beer given that she/he purchased diapers. 
This query can be narrowed down to a particular customer segment 
(eg, by zip code or gender) if we add an attribute-based conditions 
to the predicate. 

• Hyperlinked documents: Sets and keys are documents, where 
the set of document A includes all documents with hyperlinks to 
document A. Documents may be weighted by access data or page 
rank. Example queries are "the total weight of documents refer- 
encing at least 5 out of the 10 documents in Q." This predicate has 
membership-based conditions. 

• P2P network: Keys are files and sets are all neighborhoods 
of all peers (sets of files shared by peers in that neighborhood). 
Example queries are "the weight of files stored in the 5-hop neigh- 
borhoods of peer A or peer B," or "number of distinct files in a 
particular subject in the 3-hop neighborhood of peer A." Such 
queries can be used to keep the search focused on peers that con- 
tains many keys in a particular topic or peers that are more similar 
to the querying peer f 14. 52l l55l . 

Exact computation of such queries requires retrieving the full 
content of all sets relevant to the predicate, computing the union, 
and applying the predicate to all keys in the union, adding up the 
weights of keys that satisfy the predicate. On massive or dis- 
tributed data, the high cost of exact computation prohibits running 
a large number of queries (that is required for clustering or asso- 
ciation rule mining). In some cases, such as network traffic data, 
the full data set may no longer be available at the time the query is 
formulated. 

The practical solution is to produce a summary that supports 
approximate processing of such queries. A suitable summary for- 
mat is a set of sketches, one for each set. 

A basic sketch format for a single set is a weighted random 
sample without replacement of the keys in the set, obtained using 
order (bottom-k) sampling f46l ri2l [7] [42l [5T| |T7] [26] |T9] [2l 
[32). The sample is obtained by assigning a random rank value to 
each key and including the k keys with smallest rank values. The 
rank distribution for a key depends on its weight. Using differ- 
ent distributions, order samples can realize classic weighted sam- 
pling (4*6] [33) where keys are successively drawn proportionally 
to their weight or priority sampling (sequential poisson sampling) 
I44I|45||471|26I , which have estimators J26J that (nearly) minimize 
the sum of per-key variances 1531 . The sample of a set (with some 



auxiliary rank information) supports tight unbiased estimators for 
weight and selectivity aggregates over the set |19l|17|[26) . 

Multiple-set aggregates are estimated using the union-sketch 
reduction to estimators over a single-set. The reduction applies 
when sketches of different sets are coordinated, that is, the same 
set of rank values of keys is used across sets. It is known that 
without coordination (independently sampling each set), it is not 
possible to obtain strong estimators (9). 

For a multiple-set aggregate and selection predicate with rel- 
evant sets S € A, a size-fc sketch of the union Ua£S ^ ' s con " 
structed from size-fc sketches of the sets in S II2||7 61. A "single 
set" weight or selectivity estimator can then be applied to estimate 
our multiple-set aggregate, by applying it to the union sketch of 
S. 

Coordinated bottom-fc sketches can be computed efficiently for 
diverse data sources including centralized or distributed with ex- 
plicitly or implicitly represented sets |17| . Sets are explicitly rep- 
resented when the data source can be modeled as a list of keys for 
each set or a list of sets for each key (the inverted data). In the 
former case, random hash functions are used to decouple the sam- 
pling of different sets |7J[8][23[<~][42][2). Examples of explicitly- 
represented sets includes item-basket associations in a market bas- 
ket data, links in web pages, and features in documents "71 [3l 1421 

EDED. 

Sets are implicitly represented when memberships are specified 
indirectly (as in our p2p example) through some metric on a set of 
points. Implicit representation can be more concise than the corre- 
sponding explicit representation. Keys are associated with points 
and sets are specified by a point and distance pair (neighborhood) 
and include all keys within that distance from the point 1121 1221 
I21ll41|[36lll6l . Examples are nodes in a graph with the shortest 
path or reachability metric, the Euclidean plane, or time stamps or 
sequence numbers on a data stream [12 21 16, 36). When we are 
interested in multiple distances (neighborhoods) of a point (appli- 
cations include aggregates with time or spatial decay I21II16I ), all- 
distances sketches succinctly represent coordinated sketches of all 
neighborhoods of the point and can be computed efficiently over 
the implicit representation of the dataset 1 121 1 161 [771 . 

Our contributions. Our main contribution is the derivation 
of tighter unbiased estimators for multiple-sets weight and selec- 
tivity aggregates. These estimators are applicable to a set of coor- 
dinated sketches and therefore apply to the same set of sketches as 
the union-sketch method. We will show that they involve similar 
computational tasks as the union-sketch method and they domi- 
nate all previous methods in terms of estimation quality. 
Combinations of sketches. A close look at the union-sketch ap- 
proach reveals that we discard potentially useful information present 
in the union of the size-k sketches of the sets by restricting our at- 
tention to the size-k sketch of the union of these sets. If there are 
t relevant sets, the union of the sketches includes at least k but up 
to t * k distinct keys. In Section [3] we consider two more inclu- 
sive combinations of the sketches of the sets than the union-sketch: 
The short combination of sketches (SCS), and the long combina- 
tion of sketches (LCS). The LCS includes all keys in the union of 
the sketches, it contains the SCS, which contains the k keys in the 
sketch of the union. 

Combination RC estimators for weight aggregates. In Sec- 
tion [4] we develop unbiased estimators for subpopulation weight 
that leverage the additional keys contained in the LCS and SCS. 
Fully exploiting this additional information was a subtle and chal- 



lenging task: The SCS can be viewed as a variable-size sequential 
sample of the union where the number of included keys depends 
on set memberships of previously selected keys. The LCS can not 
be expressed as a sequential weighted sample of a set. The chal- 
lenge lies in benefiting from additional keys without introducing 
bias - we can not simply apply a single-sketch estimator to com- 
binations. 

We build on the powerful Rank Conditioning (RC) estima- 
tors that are the best known estimators applicable to a sketch of 
a single set 1191 [26lPI Adjusted weights are assigned to sampled 
keys and the weight estimate of a subpopulation is the sum of the 
adjusted weights of sampled keys that belong to the subpopula- 
tion. For multiple-set aggregates, our combination RC estima- 
tors assign positive adjusted weights to all keys in the combina- 
tion whereas the basic union-sketch method assigns them only to 
fc keys. We prove that our estimators are unbiased for every sub- 
population and furthermore, the covariance of the adjusted weights 
of any two different keys is zero. This guarantees that the variance 
of our estimate for a subpopulation is not larger than the sum of the 
variances of the adjusted weights of the keys in the subpopulation. 

We prove that (for any selection predicate and data set) the SCS 
RC estimators are at least as tight (at most the variance) as the 
union sketch RC estimators. Similarly to union-sketch estimators, 
the SCS estimators are applicable to general select predicates. The 
LCS RC estimators are at least as tight as the SCS RC estimators 
but are applicable to a more limited class of predicates that are 
attribute-based selections from a union of sets. Therefore, our SCS 
RC estimator strictly dominates all union-sketch based estimators, 
and for applicable select predicates, LCS RC dominates all other 
methods. In Section[5]we demonstrate how the different estimators 
are applied. 

Coordinated Poisson samples. In Section[6]we consider coordi- 
nated sketches based on Poisson samples. Poisson sampling (in- 
clusion probabilities of different keys are independent) has the dis- 
advantages over bottom-fc sampling of variable sample size and 
that coordinated Poisson samples can not be computed in a scal- 
able way over implicitly represented sets. We consider estimators 
for multiple-aggregates that generalize ones proposed in 1301 1311 
(for uniform weights) and discuss their relation to our bottom-fc 
estimators. 

Getting more from combinations. Other estimators traditionally 
applied to the union-sketch can be extended to yield tighter results 
on combinations: 

o Unbiased selectivity. In Section [7] we derive unbiased esti- 
mators for selectivity queries with respect to the union of the sets 
in <S. While selectivity can be estimated using the ratio of the es- 
timated weight of the set and the estimated weight of the union 
Ua£S A, this estimator might be biased even if we use unbiased 
weight estimators. We derive SCS unbiased selectivity estimators 
that strictly improve over traditional unbiased estimators for Jac- 
card similarity (7]|6j. 

o Maximum Likelihood (ML). In Section[8]we derive ML es- 
timators applicable to combinations of bottom-fc sketches based on 
successive weighted sampling 1461 . The derivation builds on ML 
estimators |19|, and as with other ML estimators, our new ones 
are biased. We design tailored tighter estimators for applications 



'There are tighter estimators when the exact total weight of the 
set is known, but this is not the case in our multiple-set aggregates 
since the weight of the union of sets can not be exactly recovered 
from sketches of the sets. 



where the weight of each sketched set is readily available. 
Empirical evaluation. Section[9]summarizes results of extensive 
experiments on real and synthetic data. We quantify the power of 
SCS and LCS-based estimators compared to estimators applied to 
the union-sketch. Synthetic data was designed to study how per- 
formance depends on the relations between the sets and on the 
number of sets used in the predicate. Real data allowed us to 
use natural selection predicates and demonstrate potential appli- 
cations. We discuss related work in Section [10] and conclude in 
SectionfTTI 

2. Preliminaries 

This section provides necessary background and definitions. 

A weighted set (I, w) consists of a set of keys / and a weight 
function w assigning a w(i) > to each key i € I. A rank assign- 
ment maps each key i to a random rank r(i). The ranks of keys are 
drawn independently using a family of distributions f w , where the 
rank of a key with weight w(i) is drawn according to f w (i). For a 
set J and a rank assignment r we denote by r, (J) the ith smallest 
rank of a key in J, we also abbreviate and write r(J) = J*i(J). 
Random rank assignments are used to obtain sketches (samples 
with some auxiliary information) of sets as follows. 

The k-mins sketch 1121 [7] of a set J is produced from fc in- 
dependent rank assignments, . . . ,r^°'. The sketch of a set 
J is the fc-vector (r (1) ( J), r (2) ( J), . . . ,r (fe) (J)). Depending on 
the application we may store with each of these ranks, attributes 
associated with the corresponding key. 

A bottom-k sketch (or order sample) I46II12||471 |71 |42II51|[T71 
|19|[2"l [32l of a set J is defined based on a single rank assignment 
r as follows. Let ii, . . . ,ik be the fc keys of smallest ranks in 
J. The sketch consists of fc pairs (r(ij), w(ij)), j = 1, . . . , fc, 
and rfc+i(J). (If |J| < fc we store only \ J\ pairs.) We denote a 
bottom-fc sketch of a set A with respect to a rank assignment r by 
s k (A,r). 

Consider a set A of sets over a set of keys /. Coordinated fc- 
mins or bottom-fc sketches are obtained by using the same rank 
assignment over / (for fc-mins sketches, same set of rank assign- 
ments), when producing the sketches of all sets A £ A. Coordi- 
nated sketches should include all rank values and keys' weights. 
(If we are only interested in predicates with membership-based 
conditions, then we do not have to include key attribute values in 
the sketches.) 

The union-sketch. Coordinated bottom-fc and fc-mins sketches 
have the property that for a set 5 C A of sets we can compute 
the sketch of LUgs ^ from the sketches of the sets A £ 5. For 
fc-mins sketches, the sketch of the union contains, for each rank 
function the key with minimum rank value across sets in S. For 
bottom-fc sketches, the keys in Sk ([J AeS A, r) are the keys with fc 
smallest ranks in Uass s k(A, r). Note that rfc+i(U AeS A) is the 
minimum rank of a key that is among the (fc + 1) smallest ranks in 
at least one of A £ S but is not among the fc smallest ranks in the 
union sketch. Therefore, rk+i({J AeS A) can also be determined 
from the sketches of A £ S. 

The union-sketch reduction is a method that allows us to apply 
a weight/selectivity estimator designed for attribute-based select 
predicates over a single (fc-mins or bottom-fc) sketch to estimate 
the weight/selectivity of a subpopulation specified by a general 
select predicate (with membership and attribute based conditions) 
over coordinated (fc-mins or bottom-fc) sketches of a collection of 



sets S. 

We first identify all sets 5 relevant to the predicate. We retrieve 
the sketches of 5 and compute the sketch of the union. A very 
handy property of the union-sketch is that for each key x included 
in the sketch of the union we can determine which sets of S it is 
a member of. We therefore can treat each membership in a set in 
S as an attribute of the keys. We then apply our single-sketch es- 
timator to the union-sketch of S, treating membership-based con- 
ditions of the predicate as attribute-based conditions over the keys 
in the union-sketch. 

As a concrete example, consider the inverted market-basket 
data set and the query "the number of baskets of at most 10 keys 
that contain beer or wine and cheese." To do so, we isolate the 
sketches of beer, wine, and cheese, and compute the union sketch. 
The union sketch is a random sample from the set of baskets that 
have beer, wine, or cheese. For each basket in the union we know 
if it has or does not have each one of the three goods. The size of 
the basket is an attribute. We can therefore identify all baskets in 
the sample for which the predicate "has beer or has wine and has 
cheese and has size < 10" holds and estimate the distinct count. 

WS sketches. The choice of which family of random rank func- 
tions to use matters only when keys are weighted. Otherwise, 
sketches produced using one rank function can be transformed to 
any other rank function. Rank functions f w with some convenient 
properties are exponential distributions with parameter w 1461 1331 
[T2l . The density function of this distribution is f w (x) = we _wx , 
and its cumulative distribution function is F w (x) = 1 e~ wx . 
Equivalently, if u £ U[0, 1] then — \xi{u)/w is an exponential 
random variable with parameter w. A useful property for design- 
ing estimators |l2lll6| [T7lll9l is that the minimum rank r(J) = 
miriig j r(i) of a key in a set J C / is exponentially distributed 
with parameter w(J) = ~}2 ieJ w(i) (the minimum of indepen- 
dent exponentially distributed random variables is exponentially 
distributed with parameter equal to the sum of the parameters of 
these distributions). 

Moreover, the probability that a key x £ J is the minimum 
rank key is w(x)/w(J). Hence, a fc-mins sketch of a set J is 
a weighted random sample of size k, drawn with replacement 
from J. We call a fc-mins sketch using exponential ranks a WSR 
sketch. On the other hand, a bottom-fc sketch of a set J with expo- 
nential ranks corresponds to a weighted fc-sample drawn without 
replacement from J |46j[33]. We call such a sketch a ws sketch. 
PRI sketches. If the rank value of a key with weight w is selected 
uniformly at random from [0, 1/w] then the bottom-fc sketch is 
a priority sketch (also known as Sequential Poisson Sample) 1441 
l45l 1471 l26l . This is the equivalent to choosing rank value u/w, 
where u £ U[0, 1] or using density function f w (x) = w for < 
x < 1/w and f w (x) = otherwise and cumulative distribution 
F w (x) = min{l,wx}. Estimators for PRI sketches 1261 have 
(nearly) minimum sum of per-key variances Yliei VAR(a(i)) 1531 . 
Adjusted weights. As mentioned in the introduction one tech- 
nique to obtain estimators for the weights of keys is by assigning 
an adjusted weight a(i) > to each key i in the sample (adjusted 
weight a(i) — is implicitly assigned to keys not in the sam- 
ple). The adjusted weights are assigned such that E[a(i)] = w(i), 
where the expectation is over the randomized algorithm choosing 
the sample. Using adjusted weights we can estimate the weight 
of any subpopulation J C I by J2 jeJ <*(?') = ZW|a(j)>o a U)- 
The estimate is easily computed from the sample assuming we 
have sufficient auxiliary information to tell for each key in the 



sample whether it belongs to J or not. Moreover, for any nu- 
meric function h() over keys' attributes such that h(i) > only if 
w(i) > and any subpopulation J, J2jeJ\a(j)>o a U) h U)/ w U) 
is an unbiased estimate of X^jg j h{j). 

Horvitz-Thompson (HT). Let Q be the distribution over sketches. 
If we know p^ n \i) = Pr{i £ s\s £ S7} for every i £ s then we 
can assign to i £ s the adjusted weight 



Since a(i) is when i s, it is easy to see that E[a(i)] = 
w(i). The estimator based on these adjusted weights is called the 
Horvitz-Thompson (HT) estimator 1351 . It is well known and easy 
to see that these adjusted weights are unbiased and have minimal 
variance for each key for the particular distribution S7 over rank 
assignments. 

HT on a partitioned sample space (HTp). This is a method to de- 
rive adjusted weights when we cannot determine Pr{i £ s\s £ S7} 
from the information contained in the sketch s alone. For exam- 
ple, if s is a bottom-fc sketch of (I,w), then Pr{i £ s\s £ Q} 
generally depends on all the weights w(i) for i £ / and therefore 
cannot be determined from s. 

For each key i we consider a partition of S7 into equivalence 
classes. For a sketch s, let -P'(s) C SI be the equivalence class 
of s. This partition must satisfy the following requirement: Given 
s such that i 6 s, we can compute the conditional probability 
p z (s) = Pr{i £ s' I s' £ P l (s)} from the information included 
in s. 

We can therefore compute for all i £ s the assignment a(i) = 
w(i)/p l {s) (implicitly, a(i) = for i £ s.) It is easy to see that 
within each equivalence class, E[a(i)] = w(i). Therefore, also 
over Q we have E[a(i)] = w(i). 

The variance of the adjusted weight a(i) obtained using HTP 
depends on the particular partition in the following way. (This 
follows from the convexity of the variance.) 

LEMMA 2.1. U9V Consider two partitions of the sample space, 
such that one partition is a refinement of the other. Then the vari- 
ance ofa(i) using HTP with the coarser partition is at most that 
of the HTP with the finer partition. 

Rank Conditioning (RC) adjusted weights. This is an HTp esti- 
mator for a single bottom-fc sketch )19|. The partition of SI which 
we use for assigning an adjusted weight to i is based on rank con- 
ditioning: For each possible rank value r we have an equivalence 
class P\ containing all sketches in which the fcth smallest rank 
value assigned to a key other than i is r. Note that if i £ s then 
this is the (fc + l)st smallest rank which is included in the sketch. 
It is easy to see that the inclusion probability of i in a sketch in PI 
isPr = F u , (l) (r). 

Assume s contains i\, . . . , ik and the (fc + l)st smallest rank 
value r-fe +1 . Then for key ij, we have s £ Pr 3 k+1 and a(ij) = 

3. Combinations of bottom-k sketches 

Consider a weighted set J, a set S of subsets of /, a family of 
rank functions F w (w > 0), and a set of coordinated bottom-fc 
sketches Sk(A, r) for A £ S, where r is drawn according to F w 
(w > 0). 



The short combination of sketches (SCS) of S, denoted SCSfc (S, r) 
contains the prefixes of the sketches Sk(A, r) (A G S) that include 
all keys with rank values smaller than r k + 1(5 ) = miriAes rk+i(A). 
The SCS also includes the value rfc+i(>S), The SCS contains be- 
tween k and \S\k keys. Its size depends on the rank assignment. 
Its expected size is larger when sets are of similar weights and have 
fewer common keys. 

The t > k keys in the SCS are the £ least-ranked keys in the 
union Uass A and r k+x{S) = r e+1 (Uass A )- Moreover, I is 
maximal for which we can identify the £ least-ranked keys in the 
union from information available in the sketches of S. For WS 
sketches, the SCS can be viewed as the outcome of weighted sam- 
pling without replacement (ppswor) from the union of the sets 5 
until we obtain k distinct samples from at least one of the sets in 
S. 

An important property of the SCS is that for every key x in 
SCSfc (S,r) and a set A G 5 we can determine if x G A: Indeed 
x G A if and only if x is in Sk(A, r). The SCS is the maximal set 
of keys that are included in the union of the sketches and have this 
property. 

The long combination of sketches (LCS) of 5, denoted LCSfc (5, r), 
includes all the information in the sketches s k (A, r), A G S. 

The LCS includes the SCS, but we do not have complete set- 
membership information for all its keys. These definitions and 
relations are illustrated in Figure Q] through a detailed example of 
4 sets defined over a ground set of 10 keys. The example demon- 
strates that the SCS and LCS contain more keys than the union- 
sketch. 

In the sequel we derive estimators that reflect this relation- 
ship between combinations: SCS estimators are tighter than union- 
sketch estimators, reflecting the fact that the SCS contains the union- 
sketch. They are both applicable to arbitrary select predicates, re- 
flecting the full membership information that is available for each 
included key. LCS based estimators are tighter than SCS based es- 
timators, reflecting the fact that the LCS contains the SCS but LCS 
based estimators are more limited in that they are applicable only 
to restricted select predicates, reflecting the fact that we have less 
information for included keys. 

4. Combination RC Estimators 

We derive RC estimators for SCS fe (5, r) and LCS fc (5, r). Our 
RC estimators assign adjusted weights that are positive for all keys 
included in the respective combination (other keys are implicitly 
assigned adjusted weight of zero), are unbiased for all keys in U — 
[J A( - S A, and have zero covariances. 

Let p(w,t) = ]im x _> T - F w (x) be the probability than a key 
with weight w obtains rank value that is smaller than r. 

scs RC adjusted weights a (SCS) (i): 



• r k +i(S) <- mm AeS r k+1 (A). 

• scs fe (5,r) ^{ie \J A es a k(A,r)\r(i) < r k +i(S)} 

• for all i G SCSfc(S, r), assigned the adjusted weight 
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(Forws sketches, a (SCS) (i) = w(i)/(l - exp(-w(i)r k+1 (S))), 
and for PRI sketches a (SCS) (i) = max{w(i), l/r k+1 (S)}). Fig- 
ure [2] demonstrates the computation of SCS RC adjusted weights 
and ( RC adjusted weights for the union sketch. 



• union-sketch RC adjusted weights for ij £ s 3(Uag5 r ) : 
' = Ml) Aes A), a< union >( 4j ) = wjMw^t) 



s 


r 


p{w, t) 


Q (union) fe) 


A 1} A 2 

Ax, A 2 , A 3 , A 4 


0.341 
0.3 


min{0.34lMi, 1} 
min{0.3u?, 1} 


max{ Wj , 2.93} 
max.{wj , 3.33} 


• scs RC adjusted weights for ij e scs 3 (5, r): 
T = r 4 (S),a( scs '(i J ) = Wj/p{ W j,r) 


S 


r 4 (S) 


p(w, r 4 (S)) 




Ai,A 2 

Ai, A 2 , A 3 , Ai 


0.73 
0.599 


mm{O.73i0, 1} 
min{0.599ui, 1} 


maxjiOj , 1.37} 
max{?Dj , 1.67} 



• LCS RC adjusted weights for ij e LCS 3 (5, r),S — {A x , A 2 .A 3 , A 4 }. 
Sets sorted by increasing {Ai ): A3, A4 , A\. A2 





^7 


M 


«2 


*3 


*10 


n *6 


f(S,r, ij) 


1 


4 


2 


1 


2 


1 2 


r(S,r, ij) 


0.73 


0.599 


0.73 


0.73 


0.73 


0.73 0.73 


« (LCS) (* 7 


1.37 


3 


2 


1.37 


1.37 


1.37 1.37 



Union-sketch, SCS/LCS RC adjusted wei^ 


hts for S — {Ai, A 2 }: 






key 


*1 l 2 *3 *4 


i 5 i 6 i 7 i 8 


*9 


'in 


Wj 




12 13 


11 1 1 


1 


1 






2.93 2.93 


2.93 








SCS/LCS 


1.37 2 1.37 


1.37 1.37 





1.37 


Union-sketch, SCS, and LCS RC adjusted 


weights for S — {A±,A 2 


,A 3 


A4}: 


key 




H l 2 1 3 M 




*9 


no 


Wj 




12 13 


11 1 1 


1 


1 






3.33 3.33 


3.33 








SCS 




1.67 2 1.67 3 


1.67 





1.67 


LCS 




1.37 2 1.37 3 


1.37 1.37 





1.37 



Figure 2. Upper box: Adjusted weights computation for exam- 
ple in Figure[T]scs and LCS-adjusted weights for 5 = {Ai, A2} 
are equal since r^Ai) = r4(A2) = r"4({Ai, A2}). Lower two 
tables: RC adjusted weights computed using the union-sketch, the 
SCS and the LCS. 



We show that a ( - scs - ) are unbiased: 

LEMMA 4.1. For alii G U, E[a (SCS) (i)] = w(i). 

PROOF. We apply HTp. For a key i we partition the space of 
all rank assignments according to the rank values assigned to the 
keys U \ {i}. Consider a subspace R in this partition. Fix some 
r G R and let 

t(R) = min{ min{r fc (A \ {i}) \ A € S,i G A} 

min{r fc+1 (A) | AeS,i<^A} }. 

Clearly t(R) is independent of the choice of r G R. 

For r G R, the key i is included in SCS k (S, r) if and only if 
r(i) < t(R), which happens with probability p(w(i),T(R)). If 
indeed i is included in SCS^ (5, r) then r k+ i (5) — t(R). □ 

I SCSI 

We show that cr ' have zero covariances: 

LEMMA 4.2. Fori, j £U,i^ j, COV[a (SCS) (i),a (SCS, (j)] = 



0. 

PROOF. We partition the space of rank assignments and show 
that in each set of the partition, E[a(i)a(j)] = w(i)w(j). The 
partition is according to the rank values assigned to all keys in 
U\{i,j}. Let R be a subspace in the partition, and let r be a rank 
assignment in R. Define 

t(R) = min{ min{r fe _ 1 (A \ {i,j}) \AeS,i,je A}, 
mm{r fc (A\{i}) | A G S,i G A,j A}, 
min{r fe (A \ {j}) | A G 5, j G A, i A}, 

min{r fc+1 (A) | A G S,i,j A} }. 



• Keys: 



key 


H 


«2 




»4 


«5 


«6 


«7 


is 


«9 


no 


Wj 


1 


2 


1 


3 


1 


1 


1 


1 


1 


l 


Uj 


0.487 


0.72 


0.3 


0.832 


0.765 


0.599 


0.131 


0.88C 


0.73 


0.341 


— u i 

3 W j 


0.487 


0.36 


0.3 


0.208 


0.765 


0.599 


0.131 


0.88C 


0.73 


0.341 



• Sets: 

Ai 
A 2 
A 3 
A 4 



{*1,*3,*5>*7,«9} 
{«l,*2,«5,*6,«9i*lo} 

{23,14, is, 16, ir} 

{«2,«4,«6,«8,Uo} 



• Keys sorted by increasing ranks (with matrix showing set memeberships of all keys): 



keys 


i 7 (0.131) 


i 4 (0.208) 


i 2 (0.36) 


ia (0.3) 


ho (0.341) 


h (0.487) 


i 6 (0.599) 


ig (0.73) 


i s (0.765) 


i 8 (0.886) 


Ai 


V 


X 


X 


V 


X 


V 


X 


V 


V 


X 


A 2 


X 


X 


V 


X 


V 


X 


V 


V 


V 


X 


A 3 


V 


V 


X 


V 


X 


X 


V 


X 


V 


X 


A 4 


X 


V 


V 


X 


V 


X 


V 


X 


X 


V 



• Table showing bottom-3 sketches S3(Aj, r), (3 least-ranked keys of Ai and r4(A»), the 4 -smallest rank value): 



A 


s 3 (A,r) 


r 4 (A) 


Ai 


i 7 (0.131), i 3 (0.3), ii (0.487) 


0.73 


A 2 


i 2 (0.36), iio (0.341), i 6 (0.599) 


0.73 


A 3 


i 7 (0.131), i 4 (0.208), i 3 (0.3) 


0.599 


A 4 


14 (0.208), i 2 (0.36), iio (0.341) 


0.599 



• For S — {Ai, A 2 } and S = {Ai, A2, A3, A4}, keys included in: 
o S3(Uas5 A, r) (The union-sketch of S) 

O SCS3(5, r) (contains all keys in Ua-gs s 3(Ai, r) that have rank value below 7-4(5) = mhiAiGS r4(Ai)), and 
o LCS3(5) (contains all keys in Ua-gs S3 ( J ^ i ' r ))- 



s 


combination type 


content 


# keys 


Ai,A 2 
Ai,A 2 
Ai,A 2 


s 3 {{J A es A < r ) 
SCS 3 (S,r) 
LCS3 (S,r) 


i 7 (0.131), t 2 (0.36), i 3 (0.3) 

i 7 (0.131), i 2 (0.36), i 3 (0.3), iio (0.341), h (0.487), i 6 (0.599) 
i 7 (0.131), t 2 (0.36), i 3 (0.3), iio (0.341), h (0.487), i 6 (0.599) 


3 
6 
6 


Ai,A 2 , A3, A 4 
Ai,A 2 , As,A A 
A 1 ,A 2 ,A 3 ,A 4 


"3 (UagsA.T-) 
SCS3 (S,r) 
LCS3 (S,r) 


ir (0.131), ii (0.208), i 2 (0.36) 

ir (0.131), i 4 (0.208), i 2 (0.36), i 3 (0.3), i w (0.341), ii (0.487) 

i 7 (0.131), i 4 (0.208), i 2 (0.36), i 3 (0.3), i w (0.341), ij (0.487), i 6 (0.599) 


3 
6 
7 



FigUT© 1 . Example shows a set / of 10 keys ii , . . . , iio with respective weights wi, . . . , w\o and 4 subsets A\ , . . . , A4; a random rank 
assignment r for I, using priority ranks (for each key ij, draw uj £ C/[0, 1] and compute rank value rj = uj/wj); bottom-3 sketches s 3 (A;, r) 
fori = 1, . . . ,4; for S = {Ax, A2} and 5 = {Ai , A2, A3, A 4 }, keys included in the union-sketch, the SCS, and the LCS of S. 



Clearly t(R) is independent of the choice of r G R. For r £ R, 
it is easy to see that i and j are both included in the SCS if and 
only if r(i) < t(R) and r(j) < r(R), which happens with prob- 
ability p(w(i), t (R) )p(w(j), t(R)). Otherwise either i 01 j is not 
included in the SCS and a(i)a(j) = 0. In the case they are both 
included, rk+i(S) = t(R), and therefore they are assigned ad- 
justed weights of w(i)/p(w(i),T(R)) and w(j)/p(w(j),T(R)), 
respectively. It follows that 



E[a(i)a(j)} 



p(w(i), t(R))p(w(J), T(R))w(i)w(j) 

p(w(i),T(R))p(wU),r(R)) 
w(i)w(j) . 



□ 



LCS RC adjusted weights a (LCS) (i) : 



• Sort the sets A € S by increasing rk+i(A) into the ordered set 

Ai,A 2 ,...,A| S | (r fc+ i(Aj) < r fe+ i(Aj) if i < j). 

• For all i g LCS fc (S,r): 

f(S, r, i) «- arg max h i S s fe (A h ., r) 
r(5,r,i) <- r- fc+ i(A /(s>rji) ). 



a< LCS )(i) 



w(i) 



p(w(i),T(S,r,i)) 



(2) 



Figure[2]demonstrates the computation of RC LCS adjusted weights. 

LEMMA 4.3. For alii G U, E[a (LCS) (i)] = w(i). 

PROOF. For a key i G [/, we partition the space of all rank as- 
signments according to the rank values of keys in U\ {i} . Consider 
a subspace R in this partition, let r be a rank assignment in R, and 
let t(R) = maxAg5|igA r fe(A \ {«}) , which is independent of 
the choice of r G R. 

For r G R, the key i is included in LCS^(5,r) if and only if 
r(i) < t(R). This happens with probability p(w(i), t(R)) and 
when it happens we clearly have that rk+i(Ai f(s r ) = r(R), 
which implies the lemma. □ 





condition 


relevant sets <S 


keys 


weight 


RC union 


RC SCS 


RC LCS 


best comb 


Pi 


h 6U,6[2]^A(i<8Vi>4) 


Ai,A 2 


«S, 16; «7 


3 


2.93 


2.74 


2.74 


LCS3 


p 2 


ij en i6 [ 2 |AA(j<8Vj>4) 


A X ,A 2 


15 


1 





1.37 




SCS 3 


P3 


ij £ at least two out ofAi , . . . , A4) 


A 1 ,A 2 ,A 3 ,A 4 


it,... ,17,19, Ho 


12 


10 


11.68 




SCS 3 


P4 


e Uie[4i -^i A J isodd 


A 1 ,A 2 ,A 3 ,A 4 


*1,«3,«5,«7,«9 


5 


3.33 


5 


4.1 


LCS 3 



FigUTS 3. Example predicates for the dataset in Figure[T] Table shows for each predicate P, a minimum set of relevant sets, all keys that satisfy 
P(i), weight of these keys, best applicable combination, and RC union, RC SCS, and RC LCS estimates, based on adjusted weights computation in 
FigurefS] (LCS adjusted weight is not shown for predicates where LCS is not applicable). 



LEMMA 4.4. Fori,j eU,i^ j, COV[a (LCS) (i), a (LCS) (j)l = 

0. 

PROOF. Consider the subspace where all ranks of keys other 
than i and j are fixed. We compute E[a(i)a(j)] in this subspace. 

Let Si be the collection of sets in S that contain key i and do 
not contain key j. Let Sj be the collection of sets in S that contain 
key j and do not contain key i. Finally Sij be the collection of 
sets in S containing both i and j. 

Define r' 1 = max({r k (A \ {i}) \ A G Si} U {r k -i(A \ 
{i,j}) | A G S itj ), r-* = max({ rt (A\ {j}) | A G S,} U 
{r fe _x(A\{i,j}) | A G Si,j), andr~ iJ = max({r fc (^\{i, j}) | 
A G Sij}. 

We split into cases according to the relations between r~\ r~ 3 , 
and r~ 1 ' 3 . If r~ 1 ' 3 < min{r _l , r~ 3 } or if max{r~', r~ 3 } < 
r~ 1 ' 3 , then i and j are both included (and a(i)a(j) > 0) if and 
only if r(i) < r~ % and r(j) < r~ 3 . In which case a(i) = 
, — rr and a(j) = ; rffl — tt. Therefore, under this con- 
ditioning, 

E[a(i)a(j)] = p(w(i) , r ~')p(m(j) , r ~ J ) ^-^ — ?r - — 7T 
= w(i)w(j) . 

The remaining case is r~ l < r -1 '-' < r~ J (the case r~ J < 
r~ l ' J < r~ % is symmetric), j is included if and only if r(j) < 
r -J , in which case af?) = , T^-u - The inclusion condition 

and adjusted weight of i if includ ed de pend on r(j), but if we 
fix r(j), from the proof of Lemma POl E[a(i)] = w(i). That 
is, if a(i\y,x) denotes the adjusted weight of i if r(j) = x and 
r(i) = y, then for all x, J °° a(i|y, = w(i). Therefore, 

f~ 3 W U) 1'°° 
E o(i)o(j')] = / f^OJW ; 7tc -r / a(i\y,x)dydx 

f r ~ 3 w(J) 
= / i w(j){x)dx————-w{i) 
Jo P(u>b),r J ) 

_ ■ iy (ij 

= P[w(J),r 3 ) tt^W = «i(j>(») 

p(w(j),r J) 

□ 

Consider the set 5 of subsets of I, a family of rank functions, 
and coordinated bottom-fc sketches Sk(A, r) for A G S. We com- 
pare the three RC adjusted weight assignments <v c \i) (i £ U), 
where C is 

• union: single-sketch RC adjusted weights on the sketch of 
the union Sk{U,r) 

• SCS: SCS RC adjusted weights on scSfc(<S,r) 

• LCS: LCS RC adjusted weights on LCS fc (5,r) 



LEMMA 4.5. For any J C U, 

VAR[a (LCS) (J)] < VAR[a (SCS) (J)] < VAR[a (l "" on) ( J)] . 

PROOF. Because all methods have zero covariances between 
different keys, it suffices to establish that relation for the variances 
of per-key adjusted weights, that is, for any i £ U, VAR[a ( - LCS ' ) (i)] < 
VAR[a (SCS) (i)] < VAR[a (l "" on) (i)] . 

Consider a key i and a subspace R of the sample space of rank 
assignments such that the rank values of all other keys are fixed. It 
suffices to show the variance relation in each such subspace. 

Let g (union) (i?,i), g (SCS) (^,i), q (hCS) {R,i) be the proba- 
bilities conditioned on R that i is included in the respective com- 
bination. Since the probability p(w(i) , r) is decreasing with r and 

rk+idUss^) ^ r fc+i( 5 ) < T(<S,r,i),wehavethat g ( union '(il,i) < 

q (SCS) (R,i) < q (LCS) {R,i). 

For any combination C G {union, LCS, SCS} the adjusted weight 
in R is the HT estimator a {c) (i) = w(i) / q {c) (R,i). The vari- 
ance of a' c ' {€} is decreasing with the probability q < - c - ) (R, i), which 
concludes the proof. □ 

5. Computing Estimates 

The input to our estimation procedure is a set of coordinated 
bottom-fc sketches Sk (A, r) for sets A C /, A G A, and a weight 
query specified by a predicate P : I. The desired output is an 
estimate of J2iei\p(i) 

We use the following two definitions: 

• A set of relevant sets S C A for a predicate P is a set of sets 
that suffices to determine the keys that satisfy P. For example, 
for the query "term is present in at least 2 out of books A, B, C," 
the relevant sets are A, B, and C. The query "term present in A 
and not in C" has relevant sets A and C. In both cases, these are 
minimum relevant sets. The first step of processing the query for 
P is determining (preferably a minimal) set S of relevant sets. 

• The best applicable combination for P is the most inclusive 
combination C G {SCS, LCS} that allows us to evaluate X^gc|P(i) a (*) 
using information that is available in the sketches of S. Since we 

get better estimates with the LCS, we should use the LCS when it 
is applicable. 

We can evaluate P(i) for all i G SCS for general P. This is be- 
cause we have full membership information in S sets for all keys 
in SCS(S). For i G LCS, we can determine membership of i only 
in sets A G S such that r(i) < rk+i(A). Since the combination 
must be applicable to all rank assignments, we can apply the LCS 
only to predicates P that have the form of an attribute-based con- 
dition over keys in Uass ^- ^ s an e xam P le , the SCS is the best 



applicable combination for the intersection of two sets A n -B0 



Note that once adjusted weights are computed, they can be ap- 
plied to multiple predicates that share the same relevant set 5 and 
best applicable combination C. 

Figure[3]illustrates the evaluation of an approximate weight for 
some example predicates. 

6. Poisson sampling 

We elaborate on the relation of bottom-fc (order) sampling and 
Poisson sampling. In particular, we discuss coordinated sketches 
based on Poisson samples, estimators applicable to these sketches 
and their relation to our estimators for order samples, and compu- 
tation issues. 

Poisson sampling is a classic sampling method where each key 
has an independent inclusion probability 1331 which depends on 
the weight of the key. Order (bottom-fc) sampling |46 47 44 
l45l l47l was initially developed as a twist on Poisson sampling 
intended to achieve fixed-size samples. Literature in the computer 
science field (re-)introduced order sampling as an alternative to fc- 
mins sampling and as a weighted reservoir sampling scheme 1121 

□eu ed rm w\ nil m m ed. 

Following 1331 147| but using our terminology, a Poisson sam- 
ple of a weighted set (J, w) with respect to a family of distribu- 
tion functions f w (w > 0) and a value r is obtained by drawing 
a random rank assignment on (I, w) and including all keys with 
rank value r(i) < r. (Recall that an order sample with respect to 
the same rank assignment and a fixed fc includes the fc keys with 
smallest rank values.) 

The probability that a key i is included in the sample is p(w(i), r) 
and inclusion probabilities of different keys are independent. (Re- 
call that inclusion probabilities are dependent with bottom-fc sam- 
pling). 

There is a natural correspondence between Poisson and order 
samples [44, 45, 47]: For a Poisson sample with a given r, the 
corresponding order sample has size fc = p(w(i), r) equal 
to the expected sample size of the Poisson sample. This corre- 
spondence facilitates the comparison of estimators over the two 
sampling methods. 

IPPS (inclusion probability proportion to size) sampling 1331 
|49l includes each key with probability proportional to its weight. 
These inclusion probabilities are known to minimize variance for 
a given sample size. The order sampling equivalent of IPPS is pri- 
ority sampling (previously introduced as Sequential Poisson Sam- 
pling) |33[45]|IZ]|26). Szegedy (53) recently established that the 
estimator of |26] for priority sample of size fc + 1 has a sum of 

2 We can still use the LCS indirectly to estimate w(Af)B) using the 
inclusion-exclusion formula w(Af~)B) = w(A)+w(B) — w(AU 
B). But this estimator does not perform well (see Section[9). 



per-key variances that is at most that of an IPPS Poisson sample 
with r that corresponds to fc. 

Adjusted weights for Poisson sampling, w(i)/p(w(i),T) for 
a key i, are a straightforward application of the HT estimator. In 
contrast, adjusted weights for order samples were only recently 1261 
1191 derived. Szegedy's result 1531 means that we can simultane- 
ously enjoy the fixed sample size of order sampling and (nearly) 
optimal variance of IPPS Poisson sampling. 

Poisson sampling can be performed on a data stream in a scal- 
able way only if IPPS sampling is used. Indeed, Poisson sampling 
with respect to a. fixed r, can be computed in a straightofrward way 
in a single pass that indepedently samples each key. Typically, 
however, resource constraints limit sample size. For Poisson sam- 
pling, this means that we want to set r so that the expected sample 
size is fc, that is, r that solves the equation fc = p(w(i), t). 
In a reservoir or data stream setting, we need to track the solu- 
tion of fc = ^2-p(w(i), r ) w i m respect to the prefix of the keys 
seen so far, which is possible to do efficeintly for IPPS sampling 
(and uniform weighted keys as a special case). Generally, fixed 
expected-size sampling requires two passes over the data set. 
Coordinated Poisson samples and estimators. Coordinated Pois- 
son samples for 5 are such that there is a different (fixed) ta for 
each set A £ S and the Poisson sample of A includes all keys with 
r(i) < ta- We outline SCS-like and LCS-like adjusted weights 
over coordinated Poisson samples. The expressions resemble, but 
are simpler, than the ones we present for order samples (Sectionf4j- 
We highlight properties of these estimators. The analysis (just like 
for s Poisson sample of a single set) is straightforward. 

To express SCS-like adjusted weights for Poisson samples, de- 
fine ts = minAgs ta- The set of keys in the union of the Poisson 
samples of A 6 5 that have r(i) < ts constitute a Poisson sample 
with ts of the union of 5. For each key in this sample, we know 
which sets A £ 5 it is a member of. We can therefore use the ad- 
justed weights w(i) /p(w(i),T) for these keys and obtain unbiased 
estimator for any selection predicate. This derivation generalizes 
the estimator used in [30 31] for keys with uniform weights. 

LCS-like adjusted weights are positive for all keys included in 
the union of the Poisson samples of 5. The adjusted weight of a 
key i is computed as follows. Let A be the set with largest ta such 
that i is included in the sample of A. The adjusted weight is then 
w{i)/p{w{i),TA). 

Zero covariances of the SCS-like and LCS-like adjusted weights 
of different keys are immediate from independence. Just like with 
order sampling, LCS-like adjusted weights dominates SCS-like ad- 
justed weights (have at most the variance on all subpopulations) 
but the LCS estimator is applicable only to selection predicates 
that are attribute based selections from the union of S. Coordi- 
nated Poisson and coordinated order samples can be compared if 
we set the Poisson sampling r value of each set to correspond to 
expected sample size of fc. 

Comparing estimators. Empirical evaluation of combination es- 
timators indicates similar performance. We suspect that Szegedy's [ 53 1 
result on the relation of Priority (order) sampling and threshold 
(Poisson) sampling generalizes to the respective SCS and LCS vari- 
ants. 

Scalability of sampling. Recall that coordinated bottom-fc sam- 
ples can be computed efficeintly over explicit or implicit represen- 
tation of the data sets. Coordinated Poisson samples over explic- 
itly represented data sets can be computed efficeintly in a single 
pass for IPPS sampling or if fixed r values are used for each set. 
It seems that (even for IPPS or uniform weights), there may not 



Input: set of coordinated bottom-fc sketches s^(A,r) for sets A g A; 
predicate P 

• Analyze P to determine: 

o A (minimum) set S of "relevant sets." 

o The best applicable combination C G {SCS, LCS}: 

If P is an attribute-based condition over (J 4eS A, C <— LCS. 

Else, C <— SCS. 

• Retrieve the sketches s^(A, r) of the sets A 6 S. 

• Compute adjusted weig hts a( c )(i) for i € C k (S,r) using (TJ, if 
C = SCS, or CO, if C = LCS. 

• Output: Ei 6 c|P(0 a W- 



be a scalable method for computing Poisson samples with fixed 
expected sample size or all-distances sketches over implicitly rep- 
resented sets. Intuitively, the difficulty is that respective exact r 
values must be determined for all sets. With uniform weights, ex- 
act r values correspond to exact sizes of the sets. But it seems that 
determining exact sizes (of say, all reachability sets in a graph) is 
considerably harder than the respective estimation problem 1121 . 

Our combination estimators for bottom-fc samples offer a "best 
of all worlds solution:" They match the performance benefits of 
these Poisson multiple-set estimators and have the more desirable 
framework of order sampling (fixed sample size and scalable com- 
putation in more applications.) 



7. Unbiased selectivity estimators 

We estimate selectivities through adjusted selectivities p(i) such 
that E[p(i)] = w(i)/w(U) (for all i £ U). 

We consider three types of sketches M £ {WSR, WSRD, WSRC} 
based on sampling with replacement from U. For an infinite se- 
quence s of weighted sampling with replacement from U, we con- 
sider sampling with the following stopping rules, (i) WSR (fc- 
mins): after fc (not necessarily distinct) samples, (ii) WSRD: when 
seeing the k + 1 distinct key, (iii) WSRC: with respect to 5, when, 
for at least one set A £ 5, we see the (k + l)st distinct key from 
A. 

The respective M sketch is a set of keys and multiplicities 
c < - Af '(i,s) (i £ U), (the number of times i was sampled before 
stopping). c^ M \U, s) denotes the sum of multiplicities of keys. 



side of l[3} we obtain that 

t-i / ' t + i-i 

= (1 -vfp'fZ y = (! -P)'P(!+P + P 2 + •■■)'' 

= P 

WSRC: For subsets A £ S such that i £ A consider the occurrence 
of the fcfh distinct key from A \ {i} and for subsets such that i (jL A 
consider the occurrence of the (fc + l)st distinct key from A. Fix 
I and consider the subspace of the probability space where the 
total number of samples until and including the first among these 
occurrences. If i is sampled at least once, then there are I — 1 
samples from U \ {i} in the WSRC sketch. The number of times 
i is sampled between two samples from U \ {i} is geometrically 
distributed with parameter w(i) /w(U). The proof proceeds as the 
proof for WSRD sketches. □ 

Lemma 7.2. For M £ {wsr, wsrd, wsrc}.- 

.Fori^j£ U, COV[p[ M \i,s),p[ M) (j,s)]<0. 

• E ie[ /Pi M) (M) = 1- 

Lemma 7.3. For M £ {wsr, wsrd, wsrc} and J c U: 

vAR [P r sRC) (^)] < vAR[„r sRC) (./,*)] < vam wsr \j,s)] . 

7.1 Sampling without replacement 



Lemma 7.1. For M £ {wsr, wsrd, wsrc}, p[ M> (i,s) = 
c^ M ' (i,s)/ c^ M ^ (U, s) are correct adjusted selectivities. 

PROOF, wsr: By definition, c (M) ([7, s) = k and we obtain 
the WSR fc-mins estimator in 1121 171. This well-known estimator, 
used in 1121 171 to estimate the resemblance of A\ and A2 (the sum 
of multiplicities of keys from A\ D A2 in the WSR fc-mins sketch of 
A\ U A2, divided by fc.), assigns to each key an adjusted selectivity 
equals to its multiplicity in the sketch times 1 /fc. 
WSRD: Consider a key i. Partition the probability space so that 
in each set of the partition the number of samples of keys from 
U \ {i} until we get fc distinct keys from U \ {i} is fixed. We 
will show that p(i) is an unbiased selectivity in each subspace. 
Consider a subspace where the number of samples of keys from 
U \ {i} until we get fc distinct keys from U \ {i} is I. (Notice 
that I > fc.) The estimator p(i) in this subspace is c ^ i c ^l_ 1 ■ 
This is because if we do not sample i by the time we get fc distinct 
keys from U \ {i} then c(i, s) — as well as p(i), and otherwise 
c(U, s) = c(i, s) + £ - 1 and therefore p(i) = c( .^]_ 1 . 

The number of times i is sampled between two samples from 
U\{i} is geometrically distributed with parameter p — w(i)/w(U). 
Therefore we need to show that 



E,=i ; 



il=0 



if =0 



1+E=ri 



= P ■ 



(3) 



By combining together terms in which Ylj=i *j =tin the left 



For a bottom-fc sketch when all keys have equal weights, Broder 
observed (61 1421 that the fraction of keys in the sketch of the union 
Ai U A2 that are contained in A± D A2 is an unbiased estimator 
of the Jaccard coefficient. More generally, adjusted selectivities of 
p' ws '(i) = 1/fc are correct for WS sketches when the keys have 
equal weights. 

This is not true anymore if keys have different weights as the 
following simple example shows. Consider two subsets Ai = 
{11,12,13} and A2 — {11,14}. The union contains four keys 
{ii, *2, 13, *4}. Let the corresponding weights be w(ii) = 4 and 
w(ij) = 1 for j > 1. The intersection of the two sets is {ii}. The 
resemblance is 4/7. Consider a bottom-2 WS-sketch of Ai U A2. 
The probability that i\ appears (first or second) in the sketch is 
4/7 + (3/7) * (4/6) = 6/7 in that case, the respective fraction is 
4/5 (since the other key in the sketch has weight 1). Otherwise, i\ 
does not appear in the sketch and the fraction is zero. Therefore, 
the expectation of the fraction is (6/7) (4/5) > 4/7. If we use 
the fraction of coordinates of A\ n A2 in the sample instead of the 
fraction of weights, we obtain 3/7 < 4/7. 

Unbiased selectivity estimators for WS bottom-fc sketches can 
be obtained via a mimicking process 1 1 71 . The mimicking process 
is a randomized algorithm that inputs a WS sketch and output a 
sequence of "emulated" weighted samples with replacement. We 
can also apply mimicking to an SCS of WS bottom-fc sketches by 
arranging the keys in the SCS by increasing rank values and using 
this as an input to the process. 

If we stop the process when fc keys (not necessarily distinct) 
are drawn, we obtain a WSR-sketch. If we continue until we see 
the (fc + l)st distinct key, which exhausts the "information" in the 



WS sketch, we obtain a WSRD sketch. If applied to an SCS until 
the information is exhausted, we obtain a WSRC sketch. 

Mimicking allows us to carry over unbiased estimators appli- 
cable to WSR, WSRD, and WSRC sketches to WS sketches and SCS 
combinations. 

Tighter estimators. The adjusted selectivities p^ 1 ^ (i G s) have 
the desirable qualities of (i) non-positive covariances between dif- 
ferent keys and (ii) adjusted selectivities sum up to one. (See 1191 
1X31 1541 for a discussion of these qualities.) 

We obtain tighter estimators than p[ AI \ that share these quali- 
ties but have a lower sum of per-key variances. 

Mimicking is a random process and therefore, each WS sketch 
or SCS corresponds to a probability distribution D over WSR and 
WSRD (for SCS also WSRC) sketches. Tighter estimators are ob- 
tained by taking the expectation of p[ M ' 1 (or average over multiple 
draws) over D. We can get even tighter estimators by looking at 
the expectation of this estimator over equivalence classes of WS 
sketches (or SCS combinations). Equivalence class can include all 
sketches/combinations with same rank ordering of keys, obtained 
by redrawing the ranks of keys, or (if total weight is available) 
containing the same set of keys 1 17, 19|. One interesting corollary 
is the following: 

LEMMA 7.4. If all weights are equal, thenp scs (i) = l/lfor 



all i G SCS fe (5, r), where 
selectivities. 



|SCS*.(<S, r)\, are correct adjusted 



(the k first samples have the keys s' WSR - ) ). For s, let L(s) be the 
equivalence class containing s. (That is, s' G L(s) if and only 
if c (WSR) (i,s') > 1 for all i G s (WSR) and c(i,s') = for 
i G U \ s (WSR) .) We denote by t(i\s) = E s , ei(s) c (WSR) (i\s') 
the expected number of times i occurs in a WSR-sketch from L(s). 
The adjusted selectivity estimator is 

(WSR),.s _ t{i\s) 
2 ( ' ~ ~~k~ ' 

If s' WSR ) contains the keys ii, . . . , iy (k' < K), then the prob- 
ability to obtain a particular WSR sketch with nij + 1 samples from 

ij and samples from U \ s' WSR ) is determined using the multi- 
nomial distribution 



PROOF. Redrawing the rank values of the first £ keys in U does 
not change the SCS. The resulting distribution is symmetric for all 
£ keys and therefore the expectation of p\ (i) is the same. □ 

The adjusted selectivities p (i) — l/£ are superior to p (i) = 
1/k (and in particular, improve over classic union-sketch resem- 
blance estimator t6l 1421 ). Both estimators have symmetric non- 
negative covariances and the adjusted selectivities sum up to 1. 
However, VAR[/o ws (i)] = N/k-1 > var[ P scs {%)] = N/k'-l, 
where N is the total number of keys and k' — l/E[l/£], where 
£ > k is the number of keys in the SCS. (Let pe be the prob- 
ability that the SCS contains £ keys. We have VAR[p SCS (i)] = 
J2zPt( N / £ - !) = N / k ' ~ L ) We typically have k' « E[£]. 
Section [9] includes an evaluation of this estimator relative to the 
classic union-sketch estimators for Jaccard coefficient. 

7.2 Sampling with replacement 

We derive tighter estimators than p\ M ' for WSR, WSRD, and 
WSRC sketches by considering the expectation of p\ M over equiv- 
alence classes that correspond to a partition of the sample space of 
sequences s. These estimators can be used with the mimicking 
process. 

For each s and M G {WSR, WSRD, WSRC}, let s (M) = {i|c (M) (i,s 
1} be the set of distinct keys in the corresponding sketch. For each 
key i G s, we know u>(i).[j 

WSR: Each equivalence class contains all s such that the corre- 
sponding WSR sketches share the same set s' WSR ) of distinct keys 

3 The sketch can also include rank values that can be used to es- 
timate w(U) 1121 . The estimate of w(U) is independent of the 
selectivity estimators we derive. We can therefore obtain subpop- 
ulation weight estimators by multiplying the selectivity estimators 
with the estimate of w(U). We do not need the rank values for 
selectivity estimation per se. 



~[ w(i h ) mh 



I k n " ( w{i h ) Y h + 1 

'mi + 1, . . . , rn k , + l' ^ V w{U) ) 

k\w{i\) • - • w{iy) i k — k' 

k' \w(U) k (1 + mi ) ■ ■ ■ (1 + Tny ) Wi , . . . , m k i J ^-j^ 

The conditional expectation of the count (nij + 1) of key i j over 
the subspace L(s) is therefore 

t{ij\s) = 

SM( fe ', fc - fc ')(ll he[fc _ fc , ]v 1 {3} (l +mh) ) 



where the expectation is over M(k' , k — k'), the multinomial dis- 
tribution on k' counts mi , . . . , my that sum to k — k' and proba- 
bilities Pj = ■w(i j )/w{s { ^^ ) ). 

WSRD: The equivalence classes are according to the setii, . . . ,ik 
of the k keys included in the sketch and the sum J=i c' WSRD ' (ij,s) 
of their multiplicities. Denote rrij = c' WSRD ^ (i 3 ■, s) — 1 for 
j G [A;] and J] fc =1 m 3 = o. 



w(U) — w(s) 



k + o 



K 



w{i h ) \ 



w(U) V mi + 1, . . . , m k , + \> ^ V w(U ) J 
(w(U) — w(s))(k + o)\w(ii) • • • w(i k ) / o \ 



fc!ui((7)°+ 1 (l + mi)---(l + m k ) Wi, . . . , mu> £Ji 

The conditional expectation of (nij + l)/(fc + o) over all WSRD 
sketches with set of keys s and o fixed is therefore 

i - Efl - r(fc -° )( n/ ie[fc] \{j}(i+^h) ) 

k + o E M {k,o){-^ 



> 

8. 



SCS and lcs ML estimators 



Our derivations build on derivation of ML estimators for a sin- 
gle WS sketch of a weighted set 1191 . 

We use the following Lemma, which is a consequence of the 
memoryless nature of the exponential distribution. 

LEMMA 8.1. hi 7P Consider a probability subspace of rank 
assignments over J where the k keys of smallest ranks are i\ , . . . , i k 
in increasing rank order. The rank differences ri(J),r2(J) — 



ri(J), . . . ,rfe+i(J) — rfc(J) are independent random variables, 
where rj(J) — r 3 _i(J) (j — 1, . . . , k + 1) is exponentially dis- 
tributed with parameter w(J) — X^=i w ij-t)- (we formally define 
ro(J) = 0.) 

scs ML estimator for w( U) (U — Uags A). Let ui, u%, ■ ■ . , ut 
be the keys in SCSfe(<S,r), sorted by increasing rank values. Let 

S i = J2h=0 W ( u i) ( s = 0). 

LEMMA 8.2. The ML estimator for w(U) is the solution of the 
equation YX=o = r fc+i(S). 

Proof. Let n = r(m) (ro = 0) for i < £ and let re+i = 
r k +i(S) : 

Consider an equivalence class of rank assignments such that the 
rank order of all the keys in U is as in r. The probability density of 
a rank assignment r' in this class obtaining the rank values r'j — rj 

for Uj (j G [£]) is 



Q(x - Si) cxp(-(x - Si)(r i+1 - r<)) 



(41 



where x = w(U). (First note that the fixed ordering of rank 
values in U also determines the SCS. From Lemma lSm the differ- 
ences r' i+ i — r[(i> 0) are independent exponentially distributed 
random variables with parameter w(U) — Sj.) 

By taking the natural log of © and deriving, we obtain the 
estimator as the value of x that maximizes this probability den- 
sity. □ 

SCS ML subpopulation estimator. Consider a subpopulation J C 
U. Let ii, . . . , i m be the keys in SCSfc(5, r)nJ, in increasing rank 

i ,.,u, \ e». - o). 



order. Let Sj = w (ih) ( s o 



LEMMA 8.3. The solution of 

™ _ 1 

1 



E 



= r k +i(S) 



is a maximum likelihood estimator for w(J). 
PROOF. Let rj = r(ij) (ro = 0). 

Consider an equivalence class R(r) of rank assignments such 
that the rank values of keys i 6 U\ J is r'(i) — r(i), and the rank 
order induced by r' on the keys of J is as in r. 

The joint probability density function in R for the bottom m 
ranks in J being n < • • • < r m and the (m + l)st smallest rank 
being at least r = rk+i(S), as a function of x — w(J) is 

m-l 

exp(— (x - s m )(r - r m )) Y[ ( x ~ s h )cxp(-(x - s h )(r h+1 - r h )) . 



By taking the natural logarithm and deriving, we obtain the 
estimator as the value of x that maximizes this probability den- 
sity. □ 

ML Estimators that use the weights of the sets. 

For data sources with explicit representation of sets, the summa- 
rization algorithm can provide the total weight of (the keys in) 
each set without a significant processing or communication over- 
head 1 6, 42 17 1). We derive tighter estimators that use the weight 
of sets. 

scs estimator for the weight of the intersection and union of 
two sets 



Let S = {A, B}. Let ii, . . . , i m be the keys in scSfe(5, r) n 
(AnB), in order of increasing rank values. Let . . . , i' m , be the 
keys in SCSfe(5, r)Pi(AUB\APiB), in order of increasing ranks. 
Let s 3 = J2h=i w (ih) (so = 0) and Let s'j = J2h=i w (. i 'h) 
(s' = 0). 



LEMMA 8.4. The solution of 

m—l m — 1 



1 



w(A) + w(B) -2x- s\ 



is an ML estimator for w(A n B). 

PROOF. Consider the equivalence class R(r) of rank assign- 
ments where (1) the m keys of smallest r' ranks from AnB and 
the m' keys of smallest r' ranks from A U B \ A n B are the same 
sets and same rank order as for r, and (2) that r' m+1 (A n B) > 
r k+1 {S) and r' m , +1 (A UB\AnB)> r k+1 {S). 

We compute the probability density function of the event that 
the r'(i) = r(i) for the m + m' keys in Sk{S,r), for r' £ R(r), 
as a function of x = w(A D B). 

Rank values in the two disjoint sets An B and A U B \ A n B 
are independent. The rank differences within each set are also 
independent. 

Observe that w(Al)B\ AnB) = w(A)+w(B)-2x. We take 
the natural log and derive to find the value of x that maximizes the 
probability density. □ 

The equation can be solved by a search on the interval [s m -i,w(A) + 
w(B) — s' m , _ 1 ] as the left hand side is a monotone function of x. 

Observe that if x is the ML estimator for w(An B) then the 
ML estimator for the resemblance is x/(w(A) + w(B) — x) and 
the ML estimator for the union is w(A) + w(B) — x. 

lcs estimator for the weight of set union 

Consider a rank assignment r. Let Ai 1 , Ai 2 , . . . , Ai B be the 
sets in 5, sorted according to rk+i(A). We derive ML estimator 
for w(U jels ]A %] ) that uses w(A ij ) (j G [s]). 

For any key x in the sketch of Ai . we know if x £ Ai h for 
all h > j. So we can apply a subpopulation ML estimator with 
a known total weight of 1191 to the sketch of Ai. , and get es- 
timates for the weights of Hj = A ij \ {j h> j Ai h and H'j = 
Ai, n (Uh>j A ih ). We have that w(Hj) + w(Hj) = w(A ij ). 
The weight of the union is Ylh=i w (Hh) and we obtain an esti- 
mate for the weight of the union by summing up the corresponding 
estimates of w(Hh). (Note that H s = Ai B and therefore w(H s ) 
is known exactly.) 

We can apply the same methodology to estimate the weight of 
a subpopulation J C Ujg[ s ] Ai- (specified by a predicate with 
attribute-based conditions), by estimating w(Hj n J), using the 
property that w(Hj n J) + w(H' j U (Hj \ J)) = w(A i;j ). 

For two subsets, A\ and Ai we also obtain an ML LCS estima- 
tor for w(A\ n A2) using w(Ax) + w(A.2 ) — w(A\ U A2), where 
w(Ai U A2) is the estimate of the union. This estimate is always 
nonnegative, since w(Ai U A2) < Y2t,=i w (Ah)- 

9. Empirical Evaluation 

We compare our combination estimators to state of the art es- 
timators applied to the union sketch. As a point of reference, we 
also include fc-mins estimators applied to sketch of the union of Ze- 
mins sketches. We measure the benefit of combination estimators 
by their improvement factor, which is the ratio of average relative 
error of (the best) union-sketch estimator to that of the combina- 
tion estimator. 



Datasets. We used synthetic data designed to quantify and demon- 
strate how the quality and relative performance of the estimators 
depends on different parameters of the data, such as the num- 
ber of relevant sets in the selection predicate and the relation be- 
tween these sets. We also used the following real-life datasets that 
demonstrate example applications: 

• Two IP packet traces of about 9 x 10 6 packets from gateway 
routers (peering and campus). These traces were partitioned into 5 
consecutive time periods and we produced coordinated sketches of 
the set of destination IP addresses in each time period. The cam- 
pus data had 3196, 2636, 2656, 2175, 2105 distinct addresses in 
each time period and 6830 distinct addresses overall. The peering 
data had 14158, 14564, 14281, 14705, 14483 distinct addresses in 
each time period and 37574 distinct IP addresses overall. 

• The Netflix Prize (43) Data, that consists of about 1 x 10 s re- 
views by 5 x 10 J users of 17770 movies. We consider the set 
of reviewers of each movie as a "set," and produced coordinated 
sketches for these sets. 
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Figure 4. Top: Averaged relative error of different estimators 
for the Jaccard coefficient of two sets, each containing 10000 (uni- 
formly weighted) keys. The size of the intersection is 2000 (left) 
and 200 (right). Bottom: Left: Combination sizes for 2 sets con- 
taining 10000 keys. Right: Ratio of averaged relative error of SCS 
to union-sketch estimators of Jaccard coefficient (with correspond- 
ing square root of the ratio of k and SCS size.) 



Predicates with 2 relevant sets. We first consider basic pairwise 
aggregates: The union size, intersection size, Jaccard coefficient, 
and Hamming distance (the difference of the sizes of union and 
intersection). 

We use two sets Ai and A2 of the same sizes \Ai\ = 1^4.2 1 = 
10,000 and a varying number of common keys \Ai n Az\ £ 
{200, 2000, 9000} (respective Jaccard coefficients 0.81, 0.19, 0.01) 
We applied the RC union, fc-mins union, and our RC SCS and 
RC LCS estimators for the size of the union. We applied the RC 
union, fc-mins union, and our RC SCS for the size of the intersec- 
tion. The intersection estimator based on inclusion exclusion and 
the RC LCS estimate of the union w(Ai ) +w(A 2 ) — w(Ai U A 2 ) 
was also evaluated but it performed considerably worse than other 
estimators and is not shown. Hamming distance is estimated as 
the difference of union and intersection estimators (as the differ- 
ence of unbiased estimators, this estimator is unbiased. It is also 



easy to show from the derivation that the estimate is always non- 
negative). For the Jaccard coefficient, we applied the classic fc- 
mins and bottom-fc union estimators of Broder (7][6) and our SCS 
combination selectivity estimator (Section|7]l. Figure [4] shows the 
average relative error, over 1000 runs, of Jaccard coefficient esti- 
mators. For uniform weights and for fc small relative to number 
of keys, the relative error of the union-sketch estimators decreases 
proportionally to vfc and there was a proportional decrease also 
for the combination estimators. 

The improvement factor of combination estimators is larger 
when the Jaccard coefficient is smaller. The intuitive reason is that 
smaller Jaccard coefficient means less overlap between the sets, 
hence less overlap between sketches, and more distinct keys in the 
combination that are available to the combination estimators. We 
relate the improvement factor to the size of the combination. Fig- 
ure [4] (bottom, left) shows the ratio £/k, where t is the average 
size of the combination (SCS and LCS) and fc is the size of the 
union-sketch. The figure demonstrates that the combination size 
is larger when the Jaccard coefficient is smaller. Figure[4](bottom, 
right) shows the improvement factor and the respective \f~kjl for 
our Jaccard coefficient estimators. In agreement with an analytic 
approximation (Section|7]l, we c an see that the improvement factor 
is approximated well by \JTJk, where £ is the combination size. 

In particular, our combination estimator for Jaccard coefficient 
has about half the variance of union-sketch estimators [7, 6] when 
the two sets are almost disjoint. For the applications of identify- 
ing all similar pairs (7] 1341 . and on typical corpuses, with only 
a small fraction of pairs being similar, our estimator significantly 
decreases "false positives." 

Performance dependence on the number of relevant sets. We 

next consider a synthetic distribution where all sets share 1000 
common keys and each set has its own 5000 unique keys. This 
collection of sets allows us to study how the benefit of combina- 
tion estimators increases with the number of sets. Figure [6] (top) 
shows the average relative error for estimating the size of the union 
of multiple (2,3,4, and 5) sets using the RC union, RC SCS, and 
the RC LCS estimators. The average relative error of union-size 
estimator applied to the sketch of the union is about y/2/{irk)) 
and is about y/2/(ir£)) for the combination estimators. Figure[6] 
(bottom) shows combination size ratio to fc. A simple calculations 
shows that the LCS size with i sets is about £ = 0.2fc + 0.8ifc. 
The SCS size ratio varies with fc and approaches the LCS size ra- 
tio as fc increases. Figure [6] also demonstrates that improvement 
factors are approximated well by \j£/k, where £ is the size of the 
combination. 

Figure [5] shows the improvement factor of SCS RC and LCS 
RC estimators on the destination IP addresses data sets. We esti- 
mate the total number of distinct destination IP addresses (union) 
and the number of common destination IP addresses (intersection) 
of the first i G {2, 3, 4, 5} time periods. The figure shows how 
the improvement factor of the SCS and (in particular) the LCS in- 
creases with the number of sets. The improvement factor is again 
approximated well by \JTJk (not shown). 

Performance dependence on the relation between the sets. When 
sets have fewer common keys, combinations contain more keys, 
and combination estimators have larger improvement factors. We 
demonstrate this using two collections Si and 52 of 5 sets each. 
Both collections have the same size union (49530 keys). Si con- 
tains 5 disjoint sets of 9906 keys. 52 contains sets of size 29718 
with 24765 keys common to all sets and 4953 exclusive keys for 



each set. The LCS of Si contains about 5fc keys. The LCS of S2 
contains about 5fc/3 keys (5/6 of the keys in each sketch are com- 
mon to all 5 sets). Figure [7J shows corresponding improvement 
factors of \fE for Si and \/5/S for 52 . 

SCS versus LCS. Figures [5l6l7l show comparable performance fac- 
tors (reflecting similar sizes) for the SCS and the LCS. When can 
we expect the SCS to be large? For A £ S, keys in the sketch 
of A € S are included in the SCS only if they have rank smaller 
than rk+i{S). Thus, when rk+i{S) = minxes rk+i(B) is close 
to rk+i(A), most keys are included in the SCS. The SCS is large 
when sets have closely related distributions of Tk+i(A) (sets have 
similar weights) and when \S\ is smaller (see Figure [6}. If there 
is high heterogeneity in the weight of sets in S, the distribution of 
Vk+i(S) is dominated by that of rk+i(A), where A is the heaviest 
set in S - a set B £ S will have only about kw(B) /w(A) of keys 
in the sketch of B included in the SCS. Even with homogeneous 
sets the SCS is smaller when \S\ is larger(see Figure[6]l. 

Figure [8] shows the performance of estimators for two queries 
on the Netflix data set: "the number of users with at least one rat- 
ing of a National Geographic title" and "the number of users with 
at least one rating of a movie released on or before 1930." These 
are estimates on the size of the union of sets. The correspond- 
ing sets of movie titles where larger (more sets than in previous 
datasets) and heterogeneous (high variability in number of review- 
ers of different titles). For the first query, there were 45 National 
Geographic titles with 19708 ratings by 12351 distinct reviewers. 
The number of ratings for each NG title varied between 93 and 
1 170 (mean is about 438). For the second query, there were 120 
titles with release year on or before 1930. There were 117617 
ratings with 53774 distinct reviewers. The number of ratings per 
title ranged between 54 and 12054 (mean is 980). We observe im- 
provement factor of 3-4 of the RC LCS estimator over RC union 
but we also see a ratio of 1.5-2 between the relative errors of the 
RC SCS and the RC LCS estimates, reflecting a much smaller SCS 
than LCS. 

Lastly, we consider the incremental effectiveness of combina- 
tion samples. SCS samples (that are not included in the union- 
sketch) are always as effective as additional samples from the union 
of the sets. LCS samples (that are not included in the SCS) can be 
as effective, but the effectiveness decreases with heterogeneity of 
S. Intuitively, consider two sets, one much larger than the other, 
and each contributing k samples. Then samples from the smaller 
set (that are mostly excluded from the SCS) are much less useful to 
estimate properties of the union of the sets. On the other hand, if 
we have multiple homogeneous sets, the SCS is smaller than LCS 
due to the "variance" of the k + 1st rank, but LCS samples are as 
effective. 

fc-mins versus bottom-fc. "Without replacement" (bottom-fc) es- 
timators dominate "with replacement" (fc-mins) estimator, but the 
gain is negligible with uniform weights (see Figure|4](. Gain can be 
significant only when keys are likely to be sampled repeatedly un- 
der "with replacement" sampling I26II17I . With uniform weights, 
union-sketch fc-mins estimators performs similarly to respective 
union-sketch bottom-fc estimators but combination estimators typ- 
ically outperform union-sketch estimators. Since combination es- 
timators are not applicable to fc-mins sketches, this suggests the 
use of bottom-fc sketches also with uniform weights. 

Weighted keys. Improvement factor, as a function of l/k, is 
larger when keys are weighted. This is because variance decrease 
with sample size is at least l/k (relative error decrease is at least 



1 / y/k), with uniform weights exhibiting the "worst-case" decrease. 

Restricted predicates. The demonstrated performance factor on 
unions and intersections of sets carries over when adding attribute 
based conditions to the predicate. This is because also with added 
conditions, the combination contains proportionally more keys than 
the union sketch. Examples of attribute-based conditions (on IP 
addresses) is to restrict the query to blacklisted addresses or ad- 
dresses that belongs to a particular Autonomous System or (on 
Netflix-like data) to reviewers from a certain gender or zip-code. 
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Figure 7. Relative error of RC union, RC SCS, and RC LCS 
estimators on the size of the union of 5 sets. Size of the union is 
49530. Left: 5 disjoint sets of size 9906. Right: 5 sets with 24765 
common keys to all 5 and 4953 exclusive keys in each set. 
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Figure 8. Relative error of RC union, RC SCS, and RC LCS es- 
timators on "number of distinct reviewers of National Geographic 
titles" and "number of distinct reviewers of titles released on or 
before 1930." (Netflix data set) 



10. Related work 

Sample-based coordinated sketches. Coordinated samples of 
multiple sets, based on keys "retaining" the same "random draw" 
across sets, are extensively used as a way to maximize or minimize 
sample overlap [5 44 45 , 47 , 48 1 or to facilitate (approximate) ag- 
gregations over distinct keys [30] [3T] [3 ESI [TT] EH) . Sample- 
based coordinated sketches where used with size-fc samples with 
replacement (fc-mins sketches) 1121 [7] 1161 , size-fc samples with- 
out replacement (bottom-fc/order samples) I44I|45||471[l2"1|17||19I , 
and Poisson sampling f5l i30||3Tl . 

Bottom-fc sampling f46l [33l |44l [451 |47l [26l ITTl [19| has the 
advantage (over Poisson and with-replacement sampling) of fixed 
sample size and tighter estimators. 

Multiple-set aggregates. The union-sketch reduction was used 
with both fc-mins and bottom-fc sketches 11121 171 16ll2l 1251 and we 
are not aware of a better estimator over fc-mins sketches. The only 
previous work we are aware of that leveraged combinations of 
bottom-fc sketches is 1381 , but they only derive ML estimators that 
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Figure 5. Ratio of the average relative error to that of the RC union estimator (inverse improvement factor) as a function of k. Applied to 
sketches of destination IP addresses in i S {2,3,4,5} consecutive time periods (Top: campus data set, Bottom: peering data set). Left and Middle: 
RC SCS and RC LCS estimate of the union of the first i time periods. Right: RC SCS estimate of the intersection of the first i periods. 



are biased and applicable only to WS bottom-fc sketches. Multiple- 
set aggregates over Poisson samples are approximated by produc- 
ing a Poisson sample of the union of the sets I30I I31I . Poisson 
samples have the disadvantage of variable sample size. With our 
combination estimators, coordinated bottom-fc sketches dominate 
other sampling-based sketching methods by providing both fixed 
sample size and tighter estimators. 

Coordinated sketches that are not sample based. A strength of 
sampling-based coordinated sketches is the generality of the selec- 
tion predicates combined with a tunable and potentially very small 
summary size. Methods that are not sample-based include bloom 
filters 01 and variants |28| that have the drawback that summary 
size grows linearly with the size of the corpus. Other methods, 
such as Charikar's simhash 1101 . produce tunable small-size sum- 
maries OU H2] ESI LH1 HS1 [23] [1U [37] [40l. These methods are 
very effective for some tailored goals, such as pairwise similarity 
measures between sets |34| , but have inherent limitations: Since 
the summary does not retain keys' identifiers or meta-data, there 
is no support for predicates with attribute-based conditions. For 
example, in a market basket data set, where baskets are "keys" 
and goods are "sets," we can estimate the association "purchase 
of beer implies purchase of diapers", using the ratio of the num- 
ber of baskets with beer and diapers (size of the intersection) and 
the number of baskets with beer. The more refined query where 
the selection is restricted to consumer/basket segments (such as 
"female consumer," "basket contains at most 12 goods," or "paid 
in cash"), however, can not be supported. Furthermore, only a 
limited set of membership-based conditions is supported and in- 
herently these methods do not provide a "representative sample" 
of keys that satisfy the predicate. 

A recent sampling/summarization scheme, varopt, minimizes 
the sum of variances of sets of any fixed size 1 1 31 . We do not 
know how to apply it to produce coordinated sketches. 



This paper expands on a 6-page exposition [18] and on a con- 
ference version (20l . 

11. Conclusion 

Sketches based on coordinated samples are a classic summa- 
rization method for datasets modeled as a collection of sets over a 
ground set of keys. The sketch of each set is a weighted sample of 
the keys with some auxiliary information. This powerful model 
covers a wide range of applications and sample-based sketches 
facilitate a much wider class of approximate queries than other 
sketching methods. 

We propose novel unbiased estimators for multiple-set weight 
and selectivity aggregates over coordinated bottom-fc sketches. Our 
combination estimators outperform the existing union-sketch esti- 
mators by using more samples present in the sketches of the sets 
relevant to the query. We quantify the advantage of combination 
estimators over union-sketch estimators through an extensive em- 
pirical evaluation. Our evaluation suggests that combinations esti- 
mators applied when the combination has average size of £ (has I 
distinct keys) perform comparably to estimators applied to a size- 
l union-sketch (derived from coordinated bottom-^ sketches). In 
particular, we can expect £/k factor reduction in variance (\j£jk 
reduction in estimation error) for uniform weights (distinct values 
count) and a larger factor for skewed distributions. The size I of 
a combination is between [k, t * k], where t is the number of rel- 
evant sets. Combination size is larger when there are more sets, 
when sets have fewer common keys, and when sets have homoge- 
neous weights. Our evaluation, which includes natural queries on 
real data sets demonstrate typical 25%-4 fold reduction in estima- 
tion error. 

With our combination estimators, coordinated bottom-fc sketches, 
that have the advantage (over other sample-based coordinated sketches) 
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FiCJUT© 6. Top: Relative error of RC union, RC SCS, and RC LCS estimators on the size of the union of 2,3,4, and 5 sets of size 5000 each with 
intersection of size 1000. Bottom: size ratios of combinations to k (left) and relative error ratios for LCS (middle) and SCS (right). 



of fixed-size and scalable algorithms, dominate other coordinated 
sampling methods by also providing tighter estimators. We there- 
fore expect them to become the method of choice. 
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